Project - Feature Selection, Model Selection and Tuning

Problem Statement: Concrete Strength Prediction

Objective

To predict the concrete strength using the data available in file "concrete.csv". Apply feature engineering and model tuning to obtain a score above 85%.

Steps and Tasks:

- Exploratory Data Quality Report Reflecting the Following:

  1. Univariate analysis –data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates(10 Marks).
  1. Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves. (10 marks).
  1. Feature Engineering techniques(10 marks):

    3.1. Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth.

    3.2 Get the data model ready and do a train test split.

    3.3 Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree.

- Creating the Model and Tuning It:

  1. Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks).
  1. Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks).

Attribute Information:

Given are the variable name, variable type, the measurement unit, and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.

Input variables:

    Name        -- Data Type --     Measurement -- Description
  1. Cement (cement) -- quantitative -- kg in a m3 mixture -- Input Variable
  2. Blast Furnace Slag (slag) -- quantitative -- kg in a m3 mixture -- Input Variable
  3. Fly Ash (ash) -- quantitative -- kg in a m3 mixture -- Input Variable
  4. Water(water) -- quantitative -- kg in a m3 mixture -- Input Variable
  5. Superplasticizer (superplastic)-quantitative -- kg in a m3 mixture -- Input Variable
  6. Coarse Aggregate (coarseagg) -- quantitative -- kg in a m3 mixture -- Input Variable
  7. Fine Aggregate (fineagg) -- quantitative -- kg in a m3 mixture -- Input Variable
  8. Age(age) -- quantitative -- Day (1~365) -- Input Variable

Output variable (desired target):

  1. Concrete compressive strength(strength) -- quantitative -- MPa -- Output Variable

For this project I will use the experience from previous project as a template. This will allows me to move faster in the initial part of the project until I create a Base model. After this I will apply changes for tunning the model and get better results. Hence most of the intial steps (plots, graphics, functions,etc) will be similar to the one used in previous projects.

In previous project I was not able to apply modification due to limit of time, but as seen in the mentored session, applying a templete is a good strategy to be more efficient and get better results.

Having a base model with minimal modification in the data set will let us have a reference line to compare later on. Additionally we can come back and improve the Engineering and run the model again but the output will be a different model name

For this Project i will try to apply more structural steps, it is to say, separete the plot section from graphics, in order to move easily along the document. In previous projects after many lines it was difficult to move along the project.

Import Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd #Read files
import numpy as np # numerical libraries


# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline 


from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import Image  
from os import system
In [36]:
pd.options.display.float_format = '{:,.4f}'.format
In [4]:
# Below we will read the data from the local folder
df = pd.read_csv("concrete.csv")

# Now display the header 
print ('concrete.csv data set:')
df.head(10)
concrete.csv data set:
Out[4]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3000 212.0000 0.0000 203.5000 0.0000 971.8000 748.5000 28 29.8900
1 168.9000 42.2000 124.3000 158.3000 10.8000 1,080.8000 796.2000 14 23.5100
2 250.0000 0.0000 95.7000 187.4000 5.5000 956.9000 861.2000 28 29.2200
3 266.0000 114.0000 0.0000 228.0000 0.0000 932.0000 670.0000 28 45.8500
4 154.8000 183.4000 0.0000 193.3000 9.1000 1,047.4000 696.7000 28 18.2900
5 255.0000 0.0000 0.0000 192.0000 0.0000 889.8000 945.0000 90 21.8600
6 166.8000 250.2000 0.0000 203.5000 0.0000 975.6000 692.6000 7 15.7500
7 251.4000 0.0000 118.3000 188.5000 6.4000 1,028.4000 757.7000 56 36.6400
8 296.0000 0.0000 0.0000 192.0000 0.0000 1,085.0000 765.0000 28 21.6500
9 155.0000 184.0000 143.0000 194.0000 9.0000 880.0000 699.0000 28 28.9900
In [6]:
df.tail() ## to know how the end of the data looks like
Out[6]:
cement slag ash water superplastic coarseagg fineagg age strength
1025 135.0000 0.0000 166.0000 180.0000 10.0000 961.0000 805.0000 28 13.2900
1026 531.3000 0.0000 0.0000 141.8000 28.2000 852.1000 893.7000 3 41.3000
1027 276.4000 116.0000 90.3000 179.6000 8.9000 870.1000 768.3000 28 44.2800
1028 342.0000 38.0000 0.0000 228.0000 0.0000 932.0000 670.0000 270 55.0600
1029 540.0000 0.0000 0.0000 173.0000 0.0000 1,125.0000 613.0000 7 52.6100
  • ## 1.1 Univariate analysis
In [12]:
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
In [22]:
print(f"The given dataset contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The given dataset contains {df.isna().sum().sum()} Null value")
The given dataset contains 1030 rows and 9 columns
The given dataset contains 0 Null value

Insight # 1:

  • There are no non-null values, i.e., there is not missing values since we have a value for every column and row.
In [13]:
df.shape # size of the data set also shown in the cell above
Out[13]:
(1030, 9)
In [15]:
neg_exp=df[df.lt(0)] # this is to see the number of negative values present 
print (" the number of negative entries is",len(neg_exp.index)) 
# the output might be taken in consideration later on in the calculations.
 the number of negative entries is 1030
In [16]:
df.describe().transpose() # Transpose is used here  to read better the attribute
Out[16]:
count mean std min 25% 50% 75% max
cement 1,030.00 281.17 104.51 102.00 192.38 272.90 350.00 540.00
slag 1,030.00 73.90 86.28 0.00 0.00 22.00 142.95 359.40
ash 1,030.00 54.19 64.00 0.00 0.00 0.00 118.30 200.10
water 1,030.00 181.57 21.35 121.80 164.90 185.00 192.00 247.00
superplastic 1,030.00 6.20 5.97 0.00 0.00 6.40 10.20 32.20
coarseagg 1,030.00 972.92 77.75 801.00 932.00 968.00 1,029.40 1,145.00
fineagg 1,030.00 773.58 80.18 594.00 730.95 779.50 824.00 992.60
age 1,030.00 45.66 63.17 1.00 7.00 28.00 56.00 365.00
strength 1,030.00 35.82 16.71 2.33 23.71 34.45 46.14 82.60
In [17]:
df.nunique() # Number of unique values in a column 
# this help to identify categorical values.
Out[17]:
cement          278
slag            185
ash             156
water           195
superplastic    111
coarseagg       284
fineagg         302
age              14
strength        845
dtype: int64

Insight # 2:

  • Based on the results above most of the variables are discrete imputs and maybe 1 categorical which is age
  • age has a wide range of values (365) and the 75% is low compare with he max value, indicating outliers
In [20]:
 # Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
    n = df[a].unique()
    
    # if number of unique values is less than 30, print the values. Otherwise print the number of unique values
    if len(n)<30:
        print(a + ': ')
        print(df[a].value_counts(normalize=True))
        print()
    else:
        print(a + ': ' +str(len(n)) + ' unique values')
        print()
cement: 278 unique values

slag: 185 unique values

ash: 156 unique values

water: 195 unique values

superplastic: 111 unique values

coarseagg: 284 unique values

fineagg: 302 unique values

age: 
28    0.4126
3     0.1301
7     0.1223
56    0.0883
14    0.0602
90    0.0524
100   0.0505
180   0.0252
91    0.0214
365   0.0136
270   0.0126
360   0.0058
120   0.0029
1     0.0019
Name: age, dtype: float64

strength: 845 unique values

Insight # 3:

  • 42% the inputs in age are of 28 days and the the other 2 relevant numbers are 3 and 7 days. We could try to group them to reduce the categorical numbers, for example: less than 3 days, and >90 days. This will we evaluated later on
    • ### Box plots
In [11]:
plt.subplots(figsize=(20, 20))
ax = sns.boxplot(data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);

Insight # 4:

  • Age As expected (Insight 2) is showing more amont of outliers and also less outliers in other columns: Slag, water, superplastic and fineagg
  • the pdays column seems to have continue values as seen in the boxplot and maybe will no add any value to the calculation.
  • these boxplot will be evalauted with histograms below
    • ### Histograms

Here I will check each variable to evaluate the body of distributions/tails

In [40]:
df.hist(stacked=False, bins=50, figsize=(30,30), layout=(3,3)); # Histogram of the tentative continous variable
# ***Please notice that some of these variable can be dropped after the graphical evaluation.
In [52]:
## Please notice: I found this commands on internet and it allows a better and faster visualitzation of multiples displot, 
## compared with how i did it in the previous project and with the line above.
#### With this code we can also see the mean of each variable which is better than the histogram plotted above.

import itertools 
import statistics 

cols = [i for i in df.columns]

fig = plt.figure(figsize=(20, 25))

for i,j in itertools.zip_longest(cols, range(len(cols))):
    plt.subplot(5,2,j+1)
    ax = sns.distplot(df[i],color='blue',rug=True)
    plt.axvline(df[i].mean(),linestyle="dashed",label="mean", color='black')
    plt.axvline(statistics.mode(df[i]),linestyle="dashed",label="Mode", color='Red')
    plt.axvline(statistics.median(df[i]),linestyle="dashed",label="MEDIAN", color='Green')
    plt.legend()
    plt.title(i)
    plt.xlabel("")

Insight # 5:

  • Slag, Ash, Superlastic and Age are strongly positive sknewness where the median is greater than the mode, except in age where the mode and median are almost the same.
  • the skewness in age confirm the outliers mentioned in the Insight #4
  • ## 1.2 Bi-variate analysis
    • ### Corrleation

Here I will check the correlation between variables

In [10]:
df.corr() # with this function will try to see a correlation between variables numerically 
Out[10]:
cement slag ash water superplastic coarseagg fineagg age strength
cement 1.0000 -0.2752 -0.3975 -0.0816 0.0924 -0.1093 -0.2227 0.0819 0.4978
slag -0.2752 1.0000 -0.3236 0.1073 0.0433 -0.2840 -0.2816 -0.0442 0.1348
ash -0.3975 -0.3236 1.0000 -0.2570 0.3775 -0.0100 0.0791 -0.1544 -0.1058
water -0.0816 0.1073 -0.2570 1.0000 -0.6575 -0.1823 -0.4507 0.2776 -0.2896
superplastic 0.0924 0.0433 0.3775 -0.6575 1.0000 -0.2660 0.2227 -0.1927 0.3661
coarseagg -0.1093 -0.2840 -0.0100 -0.1823 -0.2660 1.0000 -0.1785 -0.0030 -0.1649
fineagg -0.2227 -0.2816 0.0791 -0.4507 0.2227 -0.1785 1.0000 -0.1561 -0.1672
age 0.0819 -0.0442 -0.1544 0.2776 -0.1927 -0.0030 -0.1561 1.0000 0.3289
strength 0.4978 0.1348 -0.1058 -0.2896 0.3661 -0.1649 -0.1672 0.3289 1.0000

Insight # 6:

  • Strength has higher correlation (>= 30%) with Cement,Superplastic, age and 29% with water(negative correlation)
  • Strenght has a minor correlation (<1%) with the rest of variables.
  • I will try to test a model without slag and ash since they seems to have low impact in the target variable
  • Cement and ashshows a good negative correlation ~ -40%
  • Water and superplastic have a strong negative correlation of -65%
    • ### Pair Plots
In [56]:
g = sns.PairGrid(df)
g.map_upper(plt.scatter)
g.map_lower(sns.lineplot)
g.map_diag(sns.kdeplot, lw=3, legend=True);
In [8]:
sns.pairplot(df , hue='age' , diag_kind = 'kde')
plt.show()

Insight # 7

  • In the 1st Pair plot above:
    • here we see, in the diagonal, a almost normal distribution for cement, water, coarseagg,fineagg and strngth
    • Also we see a double gaussian (explained in the lectures) + skewness ( insight #5) in slag, ash, superlastic and age
    • we can also confirm graphically the observations in the Insight #6 about the correlation beteween the target variable and between other variables, however I found it easier to read the number than this graphics
  • In the 2nd paiplot:
    • Here we see that we actually have different gaussians in the diagonal for every variable if we split the age. Hence it will be interesting to change the age variable in smaller groups is evaluate the results later on
    • in this pairplot: Ash, superplastic and slag variable show a more clear the double gaussian
    • ### Heatmap
In [23]:
#Another correlation methods
plt.figure(figsize=(10,10))
mask = np.zeros_like(df.corr('spearman'))
mask[np.triu_indices_from(mask)] = True
ax =sns.heatmap(df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cmap="YlGnBu",
            mask= mask,
           )
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
plt.show()

Insight # 8

  • With this plot it is easy to recongnise the findings observed in the Insight #6 where it takes longer to find the correlation between the variables
  • Additional to Insights #6, here we also find correlation between other variables positive and negatively but less than 30%. this plot can be reviewed later on to determine the relevance of some variables
  • It is confirmed that the 3 most important variables for the target are: cement, superplastic and age
  • ## 1.3 Pandas Profilling
  • ### I will run Pandas profiling just to confirm what mentioned in the previous lines. This code was seen in the latest mentor session after all the exploration analisys was done but I want to confirm the findings so it can be used for futures projects to save some time. Also this step can be added as a common practice.
In [7]:
#pip install pandas-profiling[notebook] 
Collecting pandas-profiling[notebook]
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)
Requirement already satisfied: scipy>=1.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.5.0)
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Requirement already satisfied: requests>=2.23.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (2.24.0)
Note: you may need to restart the kernel to use updated packages.Requirement already satisfied: matplotlib>=3.2.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (3.2.2)

Requirement already satisfied: numpy>=1.16.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.18.5)
Collecting phik>=0.9.10
  Downloading phik-0.10.0-py3-none-any.whl (599 kB)
Requirement already satisfied: jinja2>=2.11.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (2.11.2)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.0.5)
Requirement already satisfied: tqdm>=4.43.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (4.47.0)
Collecting confuse>=1.0.0
  Downloading confuse-1.4.0-py2.py3-none-any.whl (21 kB)
Requirement already satisfied: joblib in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (0.16.0)
Collecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
Requirement already satisfied: ipywidgets>=7.5.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (7.5.1)
Collecting missingno>=0.4.2
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Requirement already satisfied: seaborn>=0.10.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (0.10.1)
Requirement already satisfied: attrs>=19.3.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (19.3.0)
Requirement already satisfied: jupyter-client>=6.0.0; extra == "notebook" in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (6.1.6)
Requirement already satisfied: jupyter-core>=4.6.3; extra == "notebook" in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (4.6.3)
Requirement already satisfied: networkx>=2.4 in c:\users\pedro\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (2.4)
Collecting imagehash; extra == "type_image_path"
  Downloading ImageHash-4.2.0-py2.py3-none-any.whl (295 kB)
Requirement already satisfied: Pillow; extra == "type_image_path" in c:\users\pedro\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (7.2.0)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (1.25.9)
Requirement already satisfied: idna<3,>=2.5 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2020.6.20)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied: numba>=0.38.1 in c:\users\pedro\anaconda3\lib\site-packages (from phik>=0.9.10->pandas-profiling[notebook]) (0.50.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\pedro\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\pedro\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2020.1)
Requirement already satisfied: pyyaml in c:\users\pedro\anaconda3\lib\site-packages (from confuse>=1.0.0->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied: nbformat>=4.2.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.0.7)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (7.16.1)
Requirement already satisfied: traitlets>=4.3.1 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.3.3)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.3.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied: tornado>=4.1 in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (6.0.4)
Requirement already satisfied: pyzmq>=13 in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (19.0.1)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-core>=4.6.3; extra == "notebook"->pandas-profiling[notebook]) (227)
Requirement already satisfied: decorator>=4.3.0 in c:\users\pedro\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied: six in c:\users\pedro\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied: PyWavelets in c:\users\pedro\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied: llvmlite<0.34,>=0.33.0.dev0 in c:\users\pedro\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (0.33.0+1.g022ab0f)
Requirement already satisfied: setuptools in c:\users\pedro\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (49.2.0.post20200714)
Requirement already satisfied: ipython-genutils in c:\users\pedro\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\pedro\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.2.0)
Requirement already satisfied: backcall in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied: pickleshare in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.3)
Requirement already satisfied: jedi>=0.10 in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.17.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.0.5)
Requirement already satisfied: pygments in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied: notebook>=4.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (6.0.3)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\pedro\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.16.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\pedro\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied: wcwidth in c:\users\pedro\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied: Send2Trash in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied: nbconvert in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied: prometheus-client in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.0)
Requirement already satisfied: terminado>=0.8.1 in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.3)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.3)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied: testpath in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied: defusedxml in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied: bleach in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.1.5)
Requirement already satisfied: webencodings in c:\users\pedro\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.5.1)
Requirement already satisfied: packaging in c:\users\pedro\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (20.4)
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py): started
  Building wheel for htmlmin (setup.py): finished with status 'done'
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27090 sha256=590c99874d455a84ea31183637968719be16ae560abdfb805079df39690157d6
  Stored in directory: c:\users\pedro\appdata\local\pip\cache\wheels\23\14\6e\4be5bfeeb027f4939a01764b48edd5996acf574b0913fe5243
Successfully built htmlmin
Installing collected packages: tangled-up-in-unicode, imagehash, visions, htmlmin, phik, confuse, missingno, pandas-profiling
Successfully installed confuse-1.4.0 htmlmin-0.1.12 imagehash-4.2.0 missingno-0.4.2 pandas-profiling-2.9.0 phik-0.10.0 tangled-up-in-unicode-0.0.6 visions-0.5.0
In [152]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df)

profile



Out[152]:

Insight # 9

  • The main takeaway of this step is that for the next projects I will apply it at the beggning of the study. It give an idea of where to look at to see more details, like relevant correlations and more important variables
  • Most of the observations from this report about the data were mentioned already
  • ## 2.1 Identify new features
  • No new features sin all the variables have the same unit (kg/cc)
  • the outliers will be treated in the following step (2.3.b)
  • We will drop ash', 'coarseagg', 'fineagg' since they have low correlation
  • Age will be regropued to treat outliers in the step 2.3.b
  • The reason for the double gaussian (Insight # 7) might be due to the presence of outliers and Age, as mentioned before this will be trated in 2.3.b
  • ## 2.2 Get the data model ready
In [5]:
X = df.drop('strength', axis=1)  # Seperating the target and the rest
Y = df['strength']
In [6]:
#from sklearn.model_selection import train_test_split # Splitting the data for training and testing out model

##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)

x_train.head() # this is to review the columns 
(721, 8) (309, 8)
Out[6]:
cement slag ash water superplastic coarseagg fineagg age
185 350.0000 0.0000 0.0000 203.0000 0.0000 974.0000 775.0000 14
286 374.0000 189.2000 0.0000 170.1000 10.1000 926.1000 756.7000 91
600 277.0000 0.0000 0.0000 191.0000 0.0000 968.0000 856.0000 3
691 380.0000 95.0000 0.0000 228.0000 0.0000 932.0000 594.0000 7
474 356.0000 0.0000 142.0000 193.0000 11.0000 801.0000 778.0000 28
In [15]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
70.00% data is in training set
30.00% data is in test set
In [16]:
x_train.dtypes
Out[16]:
cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
dtype: object
  • ## 2.3 Decide on the complexity of the model

I will try to print accuracy on test data using below models:

  • Linear Regression
  • Linear Regression with Polynomial features of degree 2
  • Linear Regression with Polynomial features of degree 3
  • Decision tree
  • Random forest
  • Lasso
  • Lasso with polynomial features of degree 2
  • Lasso with polynomial features of degree 3
  • Ada boosting
  • Gradient boosting
  • KNN
  • Support Vector machines
In [22]:
from sklearn.linear_model import LinearRegression, LogisticRegression,Ridge, Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from yellowbrick.classifier import ClassificationReport, ROCAUC
from sklearn.svm import SVR
In [51]:
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
  • ### 2.3.a) Using all data

The next commands were the 1st trial to get the score of all the models at once. Later on these code improved.

In [39]:
#Linear Regression 
linR = LinearRegression()
linR.fit(x_train, y_train)
pred = linR.predict(x_test)  # Predictions from linear regression
score0_train = linR.score(x_train, y_train)
score0_test = linR.score(x_test, y_test)

#Linear Regression with Polynomial features of degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
pipeline.fit(x_train,y_train)
score1_train = pipeline.score(x_train,y_train) # Predictions from linear regression degree 2
score1_test=  pipeline.score(x_test,y_test)
print('Score Linear Regression degree 2_train', score1_train)
print('Score Linear Regression degree 2_test', score1_test)

#Linear Regression with Polynomial features of degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
pipeline.fit(x_train,y_train)
score2_train = pipeline.score(x_train,y_train) # Predictions from linear regression degree 2
score2_test=  pipeline.score(x_test,y_test)
print('Score Linear Regression degree 3_train', score1_train)
print('Score Linear Regression degree 3_test', score1_test)

#dt = DecisionTree Regressor
dt = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
dt.fit(x_train, y_train)
score3_train= dt.score(x_train, y_train)
score3_test= dt.score(x_test, y_test)
pred_dt = dt.predict(x_test)

#dt = DecisionTree Regressor degree 2
pipeline= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
pipeline.fit(x_train,y_train)
score4_train = pipeline.score(x_train,y_train)
score4_test=  pipeline.score(x_test,y_test)
print('Score Dt Regressor degree 2_train', score1_train)
print('Score Dt Regressor degree 2_train', score1_test)


rf = RandomForestRegressor(random_state=42, max_depth=4)
rf.fit(x_train, y_train)
score5_train = rf.score(x_train, y_train)
score5_test= rf.score(x_test, y_test)

                           
ls=Lasso(random_state=42)
ls.fit(x_train, y_train)
pred_ls = ls.predict(x_test)
score6_train=ls.score(x_train, y_train)
score6_test= ls.score(x_test, y_test)

print (score0_train, score1_train, score2_train, score3_train, score4_train, score5_train, score6_train)
print (score0_test,  score1_test,  score2_test,  score3_test,  score4_test,  score5_test, score6_test)
Score Linear Regression degree 2_train 0.8128160328713387
Score Linear Regression degree 2_test 0.790230231217036
Score Linear Regression degree 3_train 0.8128160328713387
Score Linear Regression degree 3_test 0.790230231217036
Score Dt Regressor degree 2_train 0.8128160328713387
Score Dt Regressor degree 2_train 0.790230231217036
0.603148754063023 0.8128160328713387 0.9311497733602642 0.7253199961615051 0.9948592423407845 0.8233452602260005 0.6029347848259372
0.6339136715208276 0.790230231217036 0.8698207339992289 0.6566980986130893 0.8325497579957373 0.7566990762335056 0.6351013982458936

The next command is the improvement of the line above, it is a little more efficient and it shows the table at the end, however it looks like it can be done with a loop.

In [67]:
num_folds = 50
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)

#Linear Regression 
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]


#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])                                    

#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

#Lasso                           
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])  
                            
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

## The next lines are  to select the best model based on the k_fold_mean and add it in the last row

tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)

## This is the table with the scoring result of the algorithm with all the data

results
Out[67]:
Model score_training score_test k_fold_mean k_fold_std 95% confidence intervals
0 LinearRegression 0.6030 0.6340 0.5460 0.1830 0.187 <-> 0.905
1 LinearRegression degree 2 0.8130 0.7900 0.7520 0.1100 0.536 <-> 0.968
2 LinearRegression degree 3 0.9310 0.8700 0.8400 0.1270 0.591 <-> 1.089
3 DecisionTree Regressor 0.7250 0.6570 0.6390 0.1630 0.32 <-> 0.958
4 DecisionTree Regressor degree 2 0.9950 0.8330 0.8520 0.0870 0.681 <-> 1.023
5 Random Forest Regressor 0.8230 0.7570 0.7420 0.0980 0.55 <-> 0.934
6 Lasso 0.6030 0.6350 0.5460 0.1820 0.189 <-> 0.903
7 Lasso degree 2 0.8000 0.7730 0.7420 0.1080 0.53 <-> 0.954
8 Lasso degree 3 0.8930 0.8480 0.8270 0.1190 0.594 <-> 1.06
9 Ada boosting 0.7820 0.7090 0.7140 0.0960 0.526 <-> 0.902
10 Gradient boosting 0.9730 0.9160 0.9130 0.0510 0.813 <-> 1.013
11 Best Model = Gradient boosting 0.9730 0.9160 0.9130 0.0510 0.813 <-> 1.013

Insight # 10

  • Most of the model are showing overfitting, for instance we can see overfit in: Linear regression degree 2 and 3, Decission Tree Regressor degree 1 and 2
  • ### 2.3.b) After Outliers correction
In [93]:
#1st I will try to count the outliers in the variables as seen in the previous mentored sessions.
#Age was showing in the EDA the bigger ammount of outliers

q1= df.quantile(0.25)
q3= df.quantile(0.75)
IQR = q3-q1
low  = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range

outliers = pd.DataFrame(((df > (high)) | (df < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)

outliers
Out[93]:
Total of outliers % equivalent
cement 0 0.0000
slag 2 0.1940
ash 0 0.0000
water 9 0.8740
superplastic 10 0.9710
coarseagg 0 0.0000
fineagg 5 0.4850
age 59 5.7280
strength 4 0.3880
In [123]:
q1 = df['age'].quantile(0.25) #first quartile value
q3 = df['age'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low  = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range

df_in = df.loc[(df['age'] >= low) & (df['age'] <= high)] # meeting the acceptable range
df_out = df.loc[(df['age'] < low) | (df['age'] > high)] # not meeting the acceptable range
age_mean=int(df_in.age.mean()) #finding the mean of the acceptable range
print('age_mean =', '' ,age_mean)
print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)

#imputing outlier values with mean value
df_out.age=age_mean

#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
age_mean =  32
Shape of df without outliers =  (971, 9)
# of outliers = (59, 9)
lower rage =  -66.5 &  Hiher rage = 129.5
Shape of original data frame  (1030, 9)
Shape new df    (1030, 9)
In [124]:
## The code from above is repeated to check the outliers again

q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low  = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range

outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)

outliers
Out[124]:
Total of outliers % equivalent
cement 0 0.0000
slag 2 0.1940
ash 0 0.0000
water 9 0.8740
superplastic 10 0.9710
coarseagg 0 0.0000
fineagg 5 0.4850
age 131 12.7180
strength 4 0.3880
In [129]:
# Here i will replace the outliers of age by "higher acceptable range" due ot using mean the #of outliers increased from 59 to 131

q1 = df['age'].quantile(0.25) #first quartile value
q3 = df['age'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low  = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range

df_in = df.loc[(df['age'] >= low) & (df['age'] <= high)] # meeting the acceptable range
df_out = df.loc[(df['age'] < low) | (df['age'] > high)] # not meeting the acceptable range

print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)

#imputing outlier values with mean value
df_out.age=high

#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
Shape of df without outliers =  (971, 9)
# of outliers = (59, 9)
lower rage =  -66.5 &  Hiher rage = 129.5
Shape of original data frame  (1030, 9)
Shape new df    (1030, 9)
In [130]:
## checking the outliers again

q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low  = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range

outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)

outliers
Out[130]:
Total of outliers % equivalent
cement 0 0.0000
slag 2 0.1940
ash 0 0.0000
water 9 0.8740
superplastic 10 0.9710
coarseagg 0 0.0000
fineagg 5 0.4850
age 0 0.0000
strength 4 0.3880
In [131]:
q1 = df_rev['superplastic'].quantile(0.25) #first quartile value
q3 = df_rev['superplastic'].quantile(0.75) # third quartile value
iqr = q3-q1 #Interquartile range
low  = q1-1.5*iqr #acceptable range
high = q3+1.5*iqr #acceptable range

df_in = df_rev.loc[(df_rev['superplastic'] >= low) & (df_rev['superplastic'] <= high)] # meeting the acceptable range
df_out = df_rev.loc[(df_rev['superplastic'] < low) | (df_rev['superplastic'] > high)] # not meeting the acceptable range
superplastic_mean=int(df_in.superplastic.mean()) #finding the mean of the acceptable range
print('Superplastic_mean =' ,superplastic_mean)
print('Shape of df without outliers = ' ,df_in.shape)
print('# of outliers =', df_out.shape)
print('lower rage = ' , low ,'&', ' Hiher rage =' , high)

#imputing outlier values with mean value
df_out.superplastic= superplastic_mean

#getting back the original shape of df
df_rev=pd.concat([df_in,df_out]) #concatenating both dfs to get the original shape
df_rev.shape
print('Shape of original data frame', '' ,df_rev.shape) # original data frame
print('Shape new df ' , ' ' , df_rev.shape) #new df
Superplastic_mean = 5
Shape of df without outliers =  (1020, 9)
# of outliers = (10, 9)
lower rage =  -15.299999999999999 &  Hiher rage = 25.5
Shape of original data frame  (1030, 9)
Shape new df    (1030, 9)
In [132]:
## The code from above is repeated to check the outliers again

q1= df_rev.quantile(0.25)
q3= df_rev.quantile(0.75)
IQR = q3-q1
low  = q1-1.5*IQR #acceptable range
high = q3+1.5*IQR #acceptable range

outliers = pd.DataFrame(((df_rev > (high)) | (df_rev < (low))).sum(axis=0), columns=['Total of outliers'])
outliers['% equivalent'] = round(outliers['Total of outliers']*100/len(df), 3)

outliers
Out[132]:
Total of outliers % equivalent
cement 0 0.0000
slag 2 0.1940
ash 0 0.0000
water 9 0.8740
superplastic 0 0.0000
coarseagg 0 0.0000
fineagg 5 0.4850
age 0 0.0000
strength 4 0.3880

In the code below I will run the same as 2.3.b after outlier correction

In [133]:
X = df_rev.drop('strength', axis=1)  # Seperating the target and the rest
Y = df_rev['strength']

##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)

x_train.head() # this is to review the columns 
(721, 8) (309, 8)
Out[133]:
cement slag ash water superplastic coarseagg fineagg age
200 166.8000 250.2000 0.0000 203.5000 0.0000 975.6000 692.6000 28.0000
309 144.0000 0.0000 175.0000 158.0000 18.0000 943.0000 844.0000 28.0000
644 173.8000 93.4000 159.9000 172.3000 9.7000 1,007.2000 746.6000 14.0000
739 170.3000 155.5000 0.0000 185.7000 0.0000 1,026.6000 724.3000 28.0000
507 251.8000 0.0000 99.9000 146.1000 12.4000 1,006.0000 899.8000 28.0000
In [134]:
num_folds = 50
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)

#Linear Regression 
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]


#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])                                    

#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

#Lasso                           
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])  
                            
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

## The next lines are  to select the best model based on the k_fold_mean and add it in the last row

tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)

## This is the table with the scoring result of the algorithm with all the data

results
Out[134]:
Model score_training score_test k_fold_mean k_fold_std 95% confidence intervals
0 LinearRegression 0.7280 0.7400 0.6520 0.2420 0.178 <-> 1.126
1 LinearRegression degree 2 0.8690 0.8410 0.8020 0.1250 0.557 <-> 1.047
2 LinearRegression degree 3 0.9490 0.8760 0.7990 0.5590 -0.297 <-> 1.895
3 DecisionTree Regressor 0.7320 0.6610 0.6200 0.2000 0.228 <-> 1.012
4 DecisionTree Regressor degree 2 0.9950 0.8680 0.8460 0.0960 0.658 <-> 1.034
5 Random Forest Regressor 0.8130 0.7570 0.7250 0.1380 0.455 <-> 0.995
6 Lasso 0.7280 0.7380 0.6520 0.2440 0.174 <-> 1.13
7 Lasso degree 2 0.8540 0.8360 0.7890 0.1390 0.517 <-> 1.061
8 Lasso degree 3 0.9120 0.8820 0.8550 0.0760 0.706 <-> 1.004
9 Ada boosting 0.7740 0.7330 0.6970 0.1180 0.466 <-> 0.928
10 Gradient boosting 0.9750 0.9240 0.9140 0.0480 0.82 <-> 1.008
11 Best Model = Gradient boosting 0.9750 0.9240 0.9140 0.0480 0.82 <-> 1.008

Insight # 11

  • Outliers correction reduce the overfit (insight 10). However we see a high overfit in Decission Tree degree2
  • The outliers correction applied here could be reviewed or applied also to the variable water and evalaute the results
  • ### 2.3.c) After dropping variables
In [142]:
# these are the variable ('ash', 'coarseagg', 'fineagg') with less correlation observed during the EDA
# 'Slag' could be dropped and evaluate the model later on. In this project I will not try due to lack of time

print (' Original columns: ',df_rev.columns)

df_rev2=df_rev.drop(['ash', 'coarseagg', 'fineagg'],axis=1)

print (' New columns: ', df_rev2.columns)
 Original columns:  Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')
 New columns:  Index(['cement', 'slag', 'water', 'superplastic', 'age', 'strength'], dtype='object')
In [143]:
X = df_rev2.drop('strength', axis=1)  # Seperating the target and the rest
Y = df_rev2['strength']

##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)

x_train.head() # this is to review the columns 
(721, 5) (309, 5)
Out[143]:
cement slag water superplastic age
200 166.8000 250.2000 203.5000 0.0000 28.0000
309 144.0000 0.0000 158.0000 18.0000 28.0000
644 173.8000 93.4000 172.3000 9.7000 14.0000
739 170.3000 155.5000 185.7000 0.0000 28.0000
507 251.8000 0.0000 146.1000 12.4000 28.0000
In [144]:
num_folds = 50
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)

#Linear Regression 
model = LinearRegression()
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
results = pd.DataFrame({'Model':['LinearRegression'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = results[['Model', 'score_training','score_test','k_fold_mean', 'k_fold_std', '95% confidence intervals']]


#Linear Regression with Polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#Linear Regression with Polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',LinearRegression())])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['LinearRegression degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])

#dt = DecisionTree Regressor
model = DecisionTreeRegressor(random_state=42, max_depth=4) #scale=True)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])                                    

#dt = DecisionTree Regressor degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',DecisionTreeRegressor(random_state=7))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['DecisionTree Regressor degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# RandomForestRegressor
model = RandomForestRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Random Forest Regressor'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

#Lasso                           
model=Lasso(random_state=42)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults])  
                            
# Lasso with polynomial features of degree 2
model= Pipeline ([('poly', PolynomialFeatures(degree=2)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 2'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Lasso with polynomial features of degree 3
model= Pipeline ([('poly', PolynomialFeatures(degree=3)),('reg',Lasso(random_state=42))])
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Lasso degree 3'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Ada boosting
model = AdaBoostRegressor(n_estimators = 100, learning_rate=0.1, random_state=22)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Ada boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

# Gradient boosting
model = GradientBoostingRegressor(random_state=42, max_depth=4)
model.fit(x_train, y_train)
cv = cross_val_score(model, X, Y, cv=kfold)
mean = round(cv.mean(), 3)
std = round(cv.std(), 3)
tempresults = pd.DataFrame({'Model':['Gradient boosting'], 
                        'score_training': round(model.score(x_train, y_train),3),
                        'score_test':round(model.score(x_test, y_test),3),
                        'k_fold_mean':mean,
                        'k_fold_std':std,
                         '95% confidence intervals': str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))
                        })
results = pd.concat([results, tempresults]) 

## The next lines are  to select the best model based on the k_fold_mean and add it in the last row

tmp_best = results.sort_values(['k_fold_mean'], ascending=False).head(1)
tmp_best['Model'] = 'Best Model = ' + tmp_best['Model']
results = results.append(tmp_best, ignore_index=True)

## This is the table with the scoring result of the algorithm with all the data

results
Out[144]:
Model score_training score_test k_fold_mean k_fold_std 95% confidence intervals
0 LinearRegression 0.7030 0.7270 0.6310 0.2480 0.145 <-> 1.117
1 LinearRegression degree 2 0.8310 0.8160 0.7580 0.1900 0.386 <-> 1.13
2 LinearRegression degree 3 0.8700 0.8450 0.8130 0.1880 0.445 <-> 1.181
3 DecisionTree Regressor 0.7310 0.6710 0.6220 0.2010 0.228 <-> 1.016
4 DecisionTree Regressor degree 2 0.9940 0.8280 0.8380 0.0910 0.66 <-> 1.016
5 Random Forest Regressor 0.8100 0.7560 0.7230 0.1440 0.441 <-> 1.005
6 Lasso 0.7030 0.7260 0.6300 0.2490 0.142 <-> 1.118
7 Lasso degree 2 0.8260 0.8070 0.7550 0.1790 0.404 <-> 1.106
8 Lasso degree 3 0.8700 0.8560 0.8070 0.1370 0.538 <-> 1.076
9 Ada boosting 0.7800 0.7430 0.6880 0.1290 0.435 <-> 0.941
10 Gradient boosting 0.9680 0.9210 0.9010 0.0670 0.77 <-> 1.032
11 Best Model = Gradient boosting 0.9680 0.9210 0.9010 0.0670 0.77 <-> 1.032

Insight # 12

  • Dropping the chosen variable didn't improved the results and actually the score of the Gradient Boosting went down
  • Gradient boosting proved to be the best model between the chosen for this project
  • ### Gradient boosting is the model that best predict the strength of concrete based on the given dataset and the studied models

  • ### In the following steps I will try to tune the parameters of Gradient boosting

  • ### Since the drop of the variable didn't improved the model (insight #12) we will use the df_rev which is after correcting the outliers

In [145]:
X = df_rev.drop('strength', axis=1)  # Seperating the target and the rest
Y = df_rev['strength']

##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape, y_train.shape, y_test.shape)

x_train.head() # this is to review the columns 
(721, 8) (309, 8) (721,) (309,)
Out[145]:
cement slag ash water superplastic coarseagg fineagg age
200 166.8000 250.2000 0.0000 203.5000 0.0000 975.6000 692.6000 28.0000
309 144.0000 0.0000 175.0000 158.0000 18.0000 943.0000 844.0000 28.0000
644 173.8000 93.4000 159.9000 172.3000 9.7000 1,007.2000 746.6000 14.0000
739 170.3000 155.5000 0.0000 185.7000 0.0000 1,026.6000 724.3000 28.0000
507 251.8000 0.0000 99.9000 146.1000 12.4000 1,006.0000 899.8000 28.0000
In [148]:
from sklearn.model_selection import RandomizedSearchCV
GradientBoostingRegressor
# Prepare parameter grid
# Please notice: due to the lack of time to complet the project I took the parameters configuration below from internet,
# I checked different source and they had similar parameters with similar values.
# In real life I would need to learn more about this algorithm and the use of each parameters.

parameters = {
    'criterion': ['mse', 'mae', 'friedman_mse'], 
    'learning_rate': [0.05, 0.1, 0.15, 0.2], 
    'max_depth': [2, 3, 4, 5], 
    'max_features': ['sqrt', None], 
    'max_leaf_nodes': list(range(2, 10)),
    'n_estimators': list(range(50, 500, 50)),
    'subsample': [0.8, 0.9, 1.0]
     }
In [150]:
rs = RandomizedSearchCV(estimator=GradientBoostingRegressor(random_state=42), param_distributions=parameters, 
                 return_train_score= True, n_jobs=-1, verbose=2, cv = 10, n_iter=500)
rs.fit(x_train, y_train)
Fitting 10 folds for each of 500 candidates, totalling 5000 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   15.2s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   38.3s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 1021 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 1466 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 1993 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 2632 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 3321 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 4090 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 4965 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed: 12.9min finished
Out[150]:
RandomizedSearchCV(cv=10, estimator=GradientBoostingRegressor(random_state=42),
                   n_iter=500, n_jobs=-1,
                   param_distributions={'criterion': ['mse', 'mae',
                                                      'friedman_mse'],
                                        'learning_rate': [0.05, 0.1, 0.15, 0.2],
                                        'max_depth': [2, 3, 4, 5],
                                        'max_features': ['sqrt', None],
                                        'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8,
                                                           9],
                                        'n_estimators': [50, 100, 150, 200, 250,
                                                         300, 350, 400, 450],
                                        'subsample': [0.8, 0.9, 1.0]},
                   return_train_score=True, verbose=2)
In [151]:
mean = rs.best_score_
std = rs.cv_results_['mean_test_score'].std()

print(f"Mean training score: {rs.cv_results_['mean_train_score'].mean()}")
print(f"Mean validation score: {mean}")
print(f"Validation standard deviation: {std}")
print(f"95% confidence interval: {str(round(mean-(1.96*std),3)) + ' <-> ' + str(round(mean+(1.96*std),3))}")
print(f"Best parameters: {rs.best_params_}")
print(f"Test score: {rs.score(x_test, y_test)}")
Mean training score: 0.9333187091972943
Mean validation score: 0.9268818291433826
Validation standard deviation: 0.052603842863668336
95% confidence interval: 0.824 <-> 1.03
Best parameters: {'subsample': 1.0, 'n_estimators': 450, 'max_leaf_nodes': 9, 'max_features': 'sqrt', 'max_depth': 5, 'learning_rate': 0.05, 'criterion': 'mse'}
Test score: 0.9415891615558951

The best model is Gradient boosting having a high and similar score for the training and test data

Using this model is possible to predict the strength accurately between 82% to 100% with 95% confidence.

  • The codes used in the steps 2.3 a), b) and c) could be created in a loop and it would be more efficient. For this project a tried to do it without success. I would need more time to figure out how to organize the loop.
  • The same apply for the outliers correction, it could be donw with a loop wich is more less time consuming but it need more practice to be able to create it.
  • Dropping variable didn't improve the results, however only 1 set of variables were dropped in this project, more trials could be done in this matter.
  • It was confirmed that correcting the outliers with the mean could actually create more outliers. Also the outliers corrections helped to improve the overfit of some models.
  • In this project the pandas report was used to confirmed the observations from the EDA. It is a usefull tool that can be used in future projecst as well as in the real life work.
  • This final version of this project will be kept as a template for the future in order to be able to solve some problems much faster.
  • Along the process of solving this project different observations were highlighted (insights) and added in the index in order to find them quickly.